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Abstract 

Computational difficulties in the general application of Bretthorsts 
formalism to time-series problems, posed by the large number of possible 
models and the use of models with nonorthogonal base-functions are dis- 
cussed. The specific problem under consideration is a Bayesian procedure 
for model selection, parameter estimation, and classification, that was ap- 
plied to the search for the In Vivo T2 decay rate distributions in brain 
tissues. Through the estimation of the meta-parameter a in the process, 
we also gain a better understanding of the meaning and estimation of 
"noise" in the frame- work of probability theory as logic. 

1 Introduction 

Probability theory as logic establish the unique procedure with which one can 
find the probability assignment to any well-posed inference problem. Nonethe- 
less, this procedure may still be difficult to compute. In |Q, Bretthorst present 
a mathematical frame-work in which time-series problems can be treated: In 
this frame-work, Each of our models is a linear combination of base functions, 
that is, the model M a as a function of the time t, is of the form: 

m a 

M a {t)=J2B ja G ja (t,fj2) (1) 
j'=i 

Where: 

• fjZ is the set of the nonlinear parameters of the base function Gj a . 

• m a is the number of base functions appearing in model a. 

• Bj a is the amplitude of base function Gj a . 
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The measurements are described by: 



di = M a {U) + ei 



where e; is the noise realized in measurement I. Given the model and a 
parameter set, the difference between a datum value and the value (of this 
datum) predicted by the model is assigned to the noise. 

We discuss here the case of Gaussian assignment to the noise, and discuss 
the arising computational implications. Indeed, the assignment of Gaussian 
probability distribution to the noise is a special case, but a very important one: 
arising from maximum entropy considerations, it is the proper encoding of our 
ignorance regarding non-systematic effects with known/finite variance Q. 

The probability of getting the value of a datum point, given the Gaussian 
probability distribution of the noise, its variance a s , and all the model parame- 
ters, is: 



For the sake of clarity, from now on we omit the index a that signified the 
model M a and the explicit symbol of the model M a . We do everything inside 
the model under consideration. 

We deal with the general case of multiple measurements. That is, each time- 
series was sampled more than once. The noise is assumed uncorrelated, and thus 
the probability of getting this data is the product of the probability of getting 
each of the data points: 




p(D/rta s I) = U Y[p(d u /^a s I) 



(2^)-^ exp(-— j - E B J G ^ ^) 2 ) 



s i=i i=\ j=i 
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and 

• m — Number of model base functions. 

• n - Number of sampling points. 

• ni - Number of measurements of each time-series 

Equation (g) is the key expression to be computed in any calculation of 
probabilities in time series problems with Gaussian noise. 

In order to understand the implications of having multiple measurements 
and the relationship between the part of noise arising from the misfits of the 
data to the model and the part of noise arising from the non-systematic effects, 
wc let: 

2 ° 2 s 
(J = — — 

ni 

and use this quantity from now on. So, 

p(D/^aI) = {2^n l a 2 )-^ L exp(-^) (3) 

m n mm 

j=i i=i j=i fe=i 

Where: 



di = — > du 

1=1 

We see that di, the average of the data points for every sampling time i, 
is sufficient for finding the maximum of the likelihood. It is not sufficient to 
establish the absolute value of the likelihood at any point of the parameter 
space. 

Nevertheless, in case a is known, di is sufficient for any calculation of the 
parameters' values, since the likelihood dependence on the more detailed distri- 
bution of the data is through a constant factor, d 2 . 

As common sense suggests, if a is unknown, the extent to which the data 
points are distributed will determine the probability distribution of a. The 
probability distribution of cr, in turn, will not influence the values of the pa- 
rameters at maximum likelihood, but will determine the width of the likelihood 
and of the probability distributions for the parameters. 
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In order to compute the probability of the model P(M/DI) for the model 
comparison and estimation of the parameters, one must integrate the likelihood 
given in Equation^) over all the model parameters. 

Having the base functions orthogonal, in the sense that: 



will enable us to analytically integrate the likelihood over the linear param- 
eters. Moreover, in case the base functions are orthonormal, the expectation 
values of their amplitudes are given directly by the projection of the data on 



In spite of the general dependence of the base functions on nonlinear parame- 
ters, the base functions will be, in many important problems, almost orthogonal. 
For example, in the problem of evenly sampled multiple stationary frequencies, 
with cosines and sines as the base functions, the off diagonal terms of gjk will 
be negligible as long as the frequencies are well separated 0] . 

In this paper we demonstrate the application of Bretthorsts' formulation for 
a rather pathological problem, in which the matrix gjk is never orthogonal. The 
computational difficulties, arising from the existence of nonlinear parameters in 
our problem, lead us to consider computation-reduction tools: 

First, arranging the models into a multi-dimensional model space, in which 
a search can be performed instead of impractical calculation of the probability 
of all the models. Second, the Bretthorst formulation teach us how to analyt- 
ically integrate over the linear parameters, but we still have to integrate over 
the nonlinear parameters numerically or through approximations. Numerical 
integrations are usually impractical in spaces of many nonlinear parameters. 
We are left with the task of finding the transformation of the likelihood (as a 
function of the nonlinear parameters) into forms that can be approximated with 
satisfactory precision. 

Analyzing a multiple-measurements problem, we also note about the "noise" 
and its meaning in the Bayesian framework. In cases of multiple-measurements, 
we can rarely estimate the noise by using statistics like the Standard Deviation. 

2 Posing the Problem 
2.1 The Data 

Using a 32 echo CPMG MR imaging sequence on a 1.5T GE clinical MR scan- 
ner, spin-spin relaxation (T2) decay curves were acquired from the brains of 11 
normal humans [pi . A decay curve is the collection of the amplitudes of the 
same pixel in the image through 32 consecutive images. Accordingly, each decay 
curve contains amplitudes of 32 points in time, from 10 to 320ms (Figure (|l|)). 
The partition to specific tissues was assigned by a neurologist on the images of 



n 




i=l 



the base functions jj] . 
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Figure 1: Sample of the Raw Data. Different symbols signify data taken from 
different humans. 

ten brains. For each of twelve tissues, five of white matter and seven of gray 
matter, 800-5000 decay curves had been collected. The eleventh brain's data 
provides the new data for the classification stage. 

It is clear from the Bayesian standpoint, that we lose information by using 
this data instead of the raw signal collected by the scanner. The available data 
is the result of the imaging reconstruction, including FFT, and thus suffers from 
FFT artifacts. Moreover, after assigning the specific tissues, the localization of 
the pixels' data is lost, preventing us from taking possible tissue assignment 
errors and inter-tissue contamination into account. Nevertheless, in our formu- 
lation of the problem, we regard this data as the raw data. The information 
loss will contribute to the "noise" . 

2.2 Background Information 
2.2.1 The Set of Models 

The mechanism that produced the decay curves is modeled by: 

/*°° o *j 

di = / f(T 2 ) • e ^dT 2 + [a+ [bt + [ct 2 ]}] + d ■ (-If ■ e~™ + Noise 

Jo 

Where i=1..32 

• The T 2 Distribution f{T 2 ) is expected to be composed of a few discrete 
components. These components are parameterized by their amplitude, 
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Figure 2: Example of possible /(T2), the distribution of the T 2 decay rate. 



decay rate, and possibly width: 

J_fc k 1 (T 2 n-T 2 ) 2 

/(T 2 ) = 5> m <$(T 2 m - T 2 ) + Y j a n ^==—-e «^ 

A possible T 2 distribution is given in Figure (||). 

• Polynomial components may exist, and their amplitudes are not known. 

• The third term, an Alternating-Echo-Refocusing (AER) exists, its ampli- 
tude not known. 

So, our models can be arranged in a 3-D model space. Model Mjki will have 

• j Decaying components. We assume no more than 7 such components. 

• k Decaying components with wide distribution of the decay rate. 

• 1 Polynomial components. We assume no more than 3 such components. 

2.2.2 Noise 

The noise, in the Bayesian interpretation, is not a 'random' process, but merely 
a process where our ignorance regarding the producing mechanism is such that 
its effect on the data can not be anticipated exactly. Any known behaviour 
of the producing mechanism can, in principle, be extracted from the noise and 
incorporated into the model. The noise will include any effects, systematic or 
not, that are present in the data but not in the model. The non-systematic 
effects in our case, are expected to be of finite variance but are otherwise not 
known. Accordingly, we assign the noise a Gaussian distribution with unknown 
variance 
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2.2.3 Prior Information 

We assume no preference to any model in the model space, and ignorance re- 
garding the values of the models' parameters. The numerical representation 
of this ignorance is assigned by considering the amplitudes of any of our base 
functions as location parameters (leading to flat priors), and the decay rates 
and widths of components as scale parameters (leading to Jeffreys' priors). One 
may argue that the ignorance regarding some of the parameters should be rep- 
resented differently; nonetheless, as long as we keep our priors uninformative, 
the influence of our ignorance representation on the final results is negligible. 
The priors for all the models are assigned equal, by indifference. 



2.3 The Questions 

The questions we ask are always regarding the posterior probability (or pdf) of 
the proposition we are interested in. 

2.3.1 Model Selection and Parameter Estimation 

For each tissue, which model Mju has the highest posterior probability p(Mjki / D t i SSU eI)^ 
Given that model, what are the the most probable values of the models' param- 
eters? 



2.3.2 Classification 

After inferring about the T2 distribution of each of the tissues, using a collection 
of data sets, we are interested in the following question: Given a new set of data 
dnew, we want to find the tissue it came from. We are looking for the tissue i 
that will maximize the probability p(Ci/d new DiI) where Cj = "The new data 
was produced by the same mechanism that produced the old data set Di" This 
mechanism is described by a model (a functional form) and values of the model 
parameters. 



2.4 The Process: Model Selection and Parameter Estima- 
tion 



Section ( 2.2.1 ) defined the set of models to be tested. However, we do not wish 
to calculate the needed posterior probability p(Mjki /DI) for all these models. 
Though their number is finite, the computation time needed for computing all 
of them will render our task impractical. Our algorithm computation time is 
polynomial in the number of linear parameters and exponential in the number 
of nonlinear parameters []. We need to find the most probable model without 

1 This is since the number of linear parameters determines the dimension of matrices to 
be diagonalized, and the number of nonlinear parameters, on the other hand, determines the 
dimension of space to be searched. 
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Figure 3: Search Path Flow Chart. The algorithm is looking for the maximum 
probability in the space of models. 



calculating all the models, and in particular we want to avoid unnecessary com- 
putations of models with many nonlinear parameters. In order to achieve that, 
we describe our model space as 3-D discrete space, using the j,k,l coordinates 



defined in Section ( |2.2.1| ), and search this space for the most probable model. 

A flow chart of the search path is given in Figure (||), and the model space 
with example of search in it is given in Figure (Q). 

We now turn to the more detailed computation of each model. The posterior 
probability of model Mj k i is given by: 

p{M jk i/DiI) = J J p{M m ® m $ jk i/diI)d& jH d§ jkl (4) 



a J J p{d l /M jkl Q jkl <$> jkl I)dQ 3k id$ jk i 

Where &j k i and <&j k i are the linear and nonlinear parameter sets of model 
Mj ki , respectively. 

We use the formulation of BretthorstQ|: The integration over the linear 
parameters is done via an orhtogonalizing transformation of the base functions. 
This transformation depends on the nonlinear parameters. 

In order to integrate over the non-linear parameters by quadratic approxi- 
mation, we transform the non-linear parameters to a new set of parameters in 
which the Log [likelihood] is sufficiently quadratic around its peak. In our case, 
this transformation is the logarithm (Figure (||)). 

In Figure (^) we bring a sample of the results of applying Equation (|J) to the 
data collected from different tissues. The figure shows the models that had been 
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Figure 4: Search in the Space of Models. The second-best model is not neces- 
sarily a neighbour of the best model. 

selected for particular tissues and the estimated parameters for these tissues. 
The search for the models always started from a simple model of one decaying 
exponent. The credible regions for the parameters cannot be seen in the figure. 

2.5 The Process: Classification 

The posterior probability of Ci is just the summation, over all possible models 
(and the values of their parameters) , of the likelihood of the model (and the 
values of its parameters) in light of the new data, weighted by the probability 
of that model (and the values of its parameters) given the old data set D tissue : 

issue 

jkl J J 

We use Equation fl2.5| ) to classify each pixel in a 32-image-set, generating a 
synthetic image, in which different classified tissues are represented by different 
grey levels (Figure 



9 



2.6 Discussion 



2.6.1 Search in a multi-dimensional model space 

In order to efficiently find the most probable model, avoiding unnecessary com- 
putation of complex models, we formulated a search routine in a 3-D model 
space, including stopping conditions. The arrangement of the models into the 
particular was not forced by the nature of the problem and is a matter of deci- 
sion. The search routine used in this work found to be robust for the T2 problem 
as formulated, but is definitely an ad-hoc development. The general problem 
of establishing a search routine in the model space is difficult. It is not clear 
what transformation should be used in order to simplify the topology and avoid 
many local maxima, as demonstrated in Figure (|J). This transformation will in 
general depend on the priors of the models, the priors for the parameters values, 
the structure of the data, and the initial model attributes that span the space. 



2.6.2 Integration over nonlinear parameters 

In general, the nonlinear parameters cannot be eliminated by analytical integra- 
tion, and for cases of many coupled parameters, cannot be integrated numeri- 
cally either. In most cases, we have to approximate the likelihood by another 
function, for which analytical integration is feasible. 

The normal form of the likelihood which results from the Gaussian assign- 
ment to the noise leads us to look for quadratic approximations: The behaviour 
of Log [Likelihood] as function of any linear parameter is quadratic. In fact, 
the Bretthorst method for orthogonalization and integration over the linear pa- 
rameters is equivalent to the integration over quadratic approximation of the 
function. 

For the nonlinear parameters, the problem is reduced to the general problem 
of linear approximation; that is, finding the transformation of the nonlinear 
parameters that will bring the Log [likelihood] to a form which is sufficiently 
quadratic around its peak. The transformation used in the problem of the T2 
distribution was not found in any consistent procedure, but was suggested by 
the nature of the specific problem as analyzed in previous, 'frequentist', works. 
The possible resulting errors in the final posterior probabilities for the models 
were calculated and a "safety valve routine" was applied in order to ensure 
detection and correction of cases in which the Log [likelihood] did not transform 
into a quadratic enough function. 



2.6.3 Noise 

Our likelihood is a product of terms of the form: 

1 (Mj-dj) 2 

p(d t /ae<PI) = -== ■ e 

V 27TIJ s 
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In many case we are tempted to approximate a s using the standard deviation 
of the data. We recognize that in the frame-work of probability theory as logic 
the 'noise' characterize the deviation of the individual data points from model 
and not from the mean of the data as the STD does. Nevertheless, assuming 
our model is close enough to the data mean, we hope it will serve as a sufficient 
approximation. 

Alas, In the case of multiple measurements, a s does not characterize the de- 
viation of the individual data points from our model ("The Noise Level"). In 
the case of systematic effects not accounted for by the model, the estimation of 
o~ s will diverge in the limit of many measurements. In this limit, the quantity 
^= (assigned the symbol a in this work) will converge to a finite number, char- 
acterizing the magnitude of the systematic effects in the data not accounted for 
by the model. 
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Figure 5: Transformation of the Log [Likelihood] into Quadratic Form. 
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Figure 6: Results of Model Selection and Parameter Estimation for White Mat- 
ter Tissues (left) and Gray Matter Tissues (right). 
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Figure 7: Results of Classification: The MR1 Image (left) and the Classified 
Synthetic Image (right). Each gray level in the Synthetic Image corresponds to 
certain tissue. In the data used for this work, the variance among humans was 
larger than the differ ence between tissues. The resulting image fails to classify 
closely behaving tissues, though the discrimination between White and Gray 
Matter is reasonable. 
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